Stringent in-silico identification of putative G-protein-coupled receptors (GPCRs) of the entomopathogenic nematode Heterorhabditis bacteriophora

Abstract The infective juveniles (IJs) of entomopathogenic nematode (EPN) Heterorhabditis bacteriophora find and infect their host insects in heterogeneous soil ecosystems by sensing a universal host cue (CO2) or insect/plant-derived odorants, which bind to various sensory receptors, including G protein-coupled receptors (GPCRs). Nematode chemosensory GPCRs (NemChRs) bind to a diverse set of ligands, including odor molecules. However, there is a lack of information on the NemChRs in EPNs. Here we identified 21 GPCRs in the H. bacteriophora genome sequence in a triphasic manner, combining various transmembrane detectors and GPCR predictors based on different algorithms, and considering inherent properties of GPCRs. The pipeline was validated by reciprocal BLAST, InterProscan, GPCR-CA, and NCBI CDD search. Functional classification of predicted GPCRs using Pfam revealed the presence of four NemChRs. Additionally, GPCRs were classified into various families based on the reciprocal BLAST approach into a frizzled type, a secretin type, and 19 rhodopsin types of GPCRs. Gi/o is the most abundant kind of G-protein, having a coupling specificity to all the fetched GPCRs. As the 21 GPCRs identified are expected to play a crucial role in the host-seeking behavior, these might be targeted to develop novel insect-pest management strategies by tweaking EPN IJ behavior, or to design novel anthelminthic drugs. Our new and stringent GPCR detection pipeline may also be used to identify GPCRs from the genome sequence of other organisms.

Nematodes or roundworms are the most abundant metazoans on earth and are found in all conceivable habitats (Blumenthal and Davis, 2004;Bardgett and van der Putten, 2014).They are commonly classified as marine or terrestrial, free-living or animal/ human/plant-parasitic (Lambshead, 2004).In order to perceive the environment and seek their hosts in soil or other habitats, nematodes depend on a very well-developed chemosensory mechanism (Curtis, 2008;Reynolds et al., 2011).The nematode Caenorhabditis elegans is a well-known model organism and has tremendously contributed to the advancement in understanding of biology (Riddle et al., 1997).Free-living nematodes such as C. elegans and most other parasitic nematodes use their anterior and posterior chemosensory organs (known as amphids and phasmids, respectively), to orient themselves in response to chemosensory cues such as host exudates, food signals, sex pheromones, and other small molecules (Shivakumara et al., 2019).Entomopathogenic nematodes (EPNs) are a special class of nematodes that are pathogenic to insects, and are used worldwide for managing insect pests of crops (Koppenhöfer and Fuzy, 2006).Like other nematodes, EPNs too depend upon chemoreception, olfaction, hygrosensation, thermoreception, and gustation to sense and understand their environment and to identify hosts, noxious compounds, and sexual partners, and to undertake alternative developmental pathways (Perry, 1996;Krieger and Breer, 1999;Prasad and Reed, 1999;McGaughran et al., 2013;Shivakumara et al., 2018).The EPN infective juveniles (IJs) use universal host cue CO 2 , and host-and insect-damaged plant-derived odorants to locate their insect hosts in the heterogeneous soil ecosystem (Hallem et al., 2011;Dillman et al., 2012).The EPN H. bacteriophora is a successful biocontrol agent for several insect pests of crops (Poinar, 1975), and its small size, short lifespan, ease of culturing, hermaphroditism and close evolutionary relationship with C. elegans and other mammalian parasitic nematodes makes it an ideal model for study of insect pathogenesis and other biological studies (O'Halloran and Burnell, 2003;Hallem et al., 2007;Ratnappan et al., 2016).
Olfaction, as a part of chemoreception, uses watersoluble volatile odorants or gaseous chemical cues which bind to various sensory receptors, including G protein-coupled receptors (GPCRs).GPCRs, are also known as seven-transmembrane (7-TM) domain receptors, are the largest family and most diverse group of membrane receptors in eukaryotes including EPNs (Elrod and Chou, 2002;Audebrand et al., 2020).Most GPCRs are 200-1000 amino acids long (Bresso et al., 2019).They are found within the lipid-protein bilayer of the cell membrane and are responsible for regulating communication between the cells and its surroundings (Fig. 1) (Schiöth and Fredriksson, 2005;Sonnabend et al., 2017;Insel et al., 2019).In addition to that, it is suggested that they are present in the cell organelles, and inside the nucleus.There are hundreds of different GPCRs which can bind to Figure 1: GPCRs are found within the lipid-protein bilayer of the cell membrane and are responsible for regulating communication between cells and its surroundings (Insel et al., 2019;Schiöth & Fredriksson, 2005).GPCR becomes activated by extracellular signal, like ligand binding, resulting in an exchange of guanosine diphosphate (GDP) with guanosine triphosphate (GTP) on the Gα subunit, and a subsequent dissociation of the heterotrimer into Gα and a βγ subunit dimer.Gα and the βγ dimer then proceed to initiate downstream signaling cascades (Davies et al., 2007;Jones, 2002;Morris & Malbon, 1999).a diverse set of ligands, including peptide hormones, neurotransmitters, neuropeptides, biogenic amines, amino acids, ions, lipid-derived mediators, peptides, proteins, and odor molecules (Gether, 2000;Schiöth and Fredriksson, 2005;Hilger et al., 2018;Sriram and Insel 2018) and they are responsible for vision, olfaction, taste, and more (Rosenbaum et al., 2009).Nematode chemosensory GPCRs (NemChRs) are unique to nematodes, and are involved in the detection and reception of odor molecules, as their ligand binding sites are located on cell surfaces and are accessible to sensory molecules (Robertson, 2006;Krishnan et al., 2014).This sensory perception is processed downstream through different signalling pathways leading to modulation of nematode behavior (Bargmann et al., 1993;Bargmann, 2006).Silencing of these downstream effectors inhibits nematode response, activation, and movement towards hostemitted cues (Gang et al., 2020;Wheeler et al., 2020).The GPCRs are involved in controlling the movement of EPNs towards (or within) their host (Bresso et al., 2019).Recently, Bernot et al. (2020) and Wheeler et al. (2020) have reported the involvement of NemChRs in host-seeking behaviour.
Genome sequencing projects have made it possible for researchers to identify genes of interest in their favorite organisms.C. elegans was the first animal genome to be sequenced, and identification of ca.1000 GPCRs in the C. elegans genome (Bargmann, 1998) opened a new paradigm for understanding its chemotactic behavior and related physio-biochemical reactions (Robertson, 2006).However, predicting GPCRs in a genome has not been easy because of the complexity of these genes, and the inability of simple sequence analogy to predict a GPCR (Bockaert and Pin, 1999;Fredriksson et al., 2003;Wistrand et al., 2006;Suwa, 2014).Several tools have been developed for prediction of GPCRs from genome sequences such as GPCRHMM, GPCRPred, GPCRPipe, GPCR-Pen, and so on; however, all of them use different algorithms and databases in the background, resulting in inconsistent outcomes (Bhasin and Raghava, 2004;Wistrand et al., 2006;Theodoropoulou et al., 2013;Begum et al., 2020).For example, even if all the tools are trained on the human genome database, they identify different sequences as GPCRs, thus adding to the confusion (Takeda et al., 2002;Bjarnadottir et al., 2006).Therefore, researchers working on GPCRs are always seeking simpler means for stringent identification of GPCRs in a genome sequence.The genome of entomopathogenic H. bacteriophora was sequenced in 2013 and a total of 82 GPCRs were predicted in the genome of Heterorhabditis bacteriophora (Bai et al., 2013).In this study, we investigated the proteomic dataset of H. bacteriophora (Bai et al., 2013) and its improved annotated version published in 2018 (McLean et al., 2018) to identify putative GPCRs.We designed a pipeline for this purpose by combining several independent trans-membrane (TM) finders and GPCR predictors.We also investigated putative GPCRs in the better annotated genome of model organism C. elegans to test the validity of our method.Further, we predicted the functions of the identified GPCRs in-silico, and classified them into family and subfamily levels (Attwood and Findlay, 1994).To the best of our knowledge, this is the first pipeline of its kind for stringent identification of GPCRs from a genome sequence, and could be useful for identifying GPCRs from other eukaryotic genome sequence assemblies.

Identification of putative GPCRs
In order to predict putative GPCRs in H. bacteriophora and C. elegans, we followed a combination of different bioinformatic approaches, including both motif-based and alignment-free methods, as well as structural similarity-based approaches.The sequence alignment methods were comprised of BLASTX, BLASTP (Altschul et al., 1990), and Pfam (Bateman et al., 2000).The alignment-free method included GPCRPred (Bhasin and Raghava, 2004), whereas TMHMM (Krogh et al., 2001), GPCRHMM (Wistrand et al., 2006), and GPCRPipe (Theodoropoulou et al., 2013) come under machine learning or statistical algorithm-based methods.The entire pipeline for GPCR identification was composed of three key steps, as illustrated in Fig. 2.

Length-based screening and transmembrane detection
The initial step was to filter the large number of protein sequences based on their length.The proteomic dataset (Bai et al., 2013;Mclean et al., 2018) of H. bacteriophora and C. elegans (C.elegans Sequencing Consortium, 1998) containing 21,699, 15,218, and 28,447 predicted protein sequences, respectively, were filtered based on the length of sequences to discard proteins with less than 200 amino acid residues.These sequences were then checked for the number of transmembrane helices.Four different transmembrane detectors were used for this purpose: (1) TMHMM 2.0 (https:// services.healthtech.dtu.dk/service.php?TMHMM-2.0),a membrane protein topology prediction method based on hidden Markov model (HMM) (Krogh et al., 2001); (2) Figure 2: Visual representation of overall strategy to predict and identify novel GPCRs through bioinformatic approach.Groups of sequences are represented by rectangles, tools by rhombuses and criteria by circles.
Phobius (https://phobius.sbc.su.se/), an HMM-based signal peptide predictor that detects transmembrane protein topology also (Käll et al., 2004;Käll et al., 2007); (3) HMMTOP2 (http://www.enzim.hu/hmmtop/),an automatic server for predicting transmembrane helices and topology of proteins based on HMM (Tusnády and Simon, 1998;Tusnády and Simon, 2001); and (4) TOPCONS (https://topcons.cbr.su.se/), a membrane protein topology prediction tool based on a complex algorithm (Tsirigos et al., 2015).TMHMM2 was run using the "one line per protein" option.Phobius was run in the Normal prediction mode with the short output format mode selected.HMMTOP2 was run in the advanced mode with the parameters: FASTA format, Single Sequence type, Reliable prediction type, text output, and the results in one line.TOPCONS was operated with default parameters.

GPCR Prediction
Seven transmembrane sequences, detected by any of the three transmembrane detectors, were then subjected to four different GPCR prediction tools to identify putative GPCRs from those sequences.These were GPCRHMM (https://gpcrhmm.sbc.su.se/index.html), an HMM-based GPCR prediction tool (Wistrand et al., 2006); GPCRPipe (http://aias.biol.uoa.gr/GPCRpipe/search.php),which detects GPCRs using pHMM library and HMM for GPCRs (Theodoropoulou et al., 2013); GPCR Prediction Ensemble (GPCR-Pen) (https://gpcr.utep.edu/),which determines GPCR based on four different algorithms (Begum et al., 2020); and GPCRPred (https://webs.iiitd.edu.in/raghava/gpcrpred/), a support vector machine (SVM) based approach (Bhasin and Raghava, 2004).In case of GPCRPipe, it detects GPCRs in two consecutive steps.In the first step, it uses Hidden Markov Model and then a library of 39 Pfam profile HMMs (specific to different families of GPCR).Analyses using GPCRHMM were performed with the local scoring option turned on.We used the GPCRPipe "AND" method as it allows prediction of GPCR only when it is confirmed by two methods.This selection resulted in a reduced number of GPCRs and a limited number of false positives.GPCRPred was run in its webserver without any manipulation in default parameters.While using GPCR-Pen, we restricted the prediction algorithms to Pfam and BLAST to get results only based on sequence similarity, and we fetched those sequences which were confirmed by both of these methods.

JOURNAL OF NEMATOLOGY
The outcomes of all the bioinformatic tools were then compiled and visualized using the Venny 2.1 software (https://bioinfogp.cnb.csic.es/tools/venny/).Venn diagrams were created to compare and to reflect the number of seven transmembrane sequences and predicted GPCRs made by different combinations of tools.

Pipeline validation and family-wide classification
Apart from predicting GPCR, GPCRPred was also used to know the families and subfamilies of GPCRs based on the dipeptide composition of proteins.It categorizes GPCRs into class A (rhodopsin-like), B (secretin and adhesion), C (metabotropic glutamate), D (fungal pheromone receptors), E (cAMP receptors), or F (frizzled), as per International Union of Basic and Clinical Pharmacology (IUPHAR) nomenclature (http://www.guidetopharmacology.org/nomenclature.jsp; Attwood and Findlay, 1994;Kolakowski, 1994;Sadowski and Parish, 2003).
All GPCR candidates predicted by the pipeline were analyzed by another alignment-free method, GPCR-CA (http://218.65.61.89:8080/bioinfo/GPCR-CA), whose algorithm is based on cellular automation (CA) (Xiao et al., 2009).GPCR-CA is a bilayer predictor: the first layer identifies a query protein as GPCR or non-GPCR; if it is a GPCR, then a second layer classifies it into an A-F classification system, based on sequence homology and functional similarity (Attwood and Findlay, 1994).
In order to validate the pipeline, the putative GPCRs were then checked for the presence of an extracellular N terminus and an intracellular C terminus using HMMTOP2 and TMHMM2.
We screened all these GPCRs through Pfam (http://pfam.xfam.org/)(Punta et al., 2012), a database of protein families and an HMM-based (Eddy, 1998) protein classifier to get a complete and accurate classification of protein families and domains.Apart from this, InterPro (https://www.ebi.ac.uk/ interpro/) was also used to analyze GPCR sequences functionally and to classify them into families, and ultimately to predict the presence of signature domains and important sites.We revalidated all these analyses through conserved domain analysis by NCBI CDD search (https://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi).
Putative H. bacteriophora GPCRs were classified into GPCR families according to the families to which their orthologous proteins in C. elegans and Ancylostoma caninum were assigned.For this purpose, orthology was detected between these resultant sequences and other organisms with the widely used reciprocal BLAST approach in WormBase Parasite (https://parasite.wormbase.org/index.html).This is done by using each GPCR sequence detected from H. bacteriophora as a query to the Caenorhabditis elegans proteome and selecting the highest scoring C. elegans protein sequence.This C. elegans sequence was then used as the query in a BLAST search against the proteome of H. bacteriophora.If the C. elegans sequence reciprocally identified the original query sequence as the highest scoring hit with the BLAST, these two sequences were then considered as an orthologous pair, which implies that they may possess similar functions and biological roles and fall into the same GPCR subfamily.H. bacteriophora GPCRs were assigned to families and subfamilies based on their orthologous GPCRs in other model organisms, such as C. elegans and A. caninum.In all BLAST comparisons, we used the BLOSUM62 scoring matrix and a cut-off threshold of 1e-1.Additionally, motif prediction of identified GPCRs was performed in MEME Suite 5.4.1 (https://meme-suite.org/meme/).Motifs were functionally annotated in HHpred (http:// toolkit.tuebingen.mpg.de/hhpred) using default parameters.Functional motifs were identified in the GPCR sequences using the MOTIF search tool (https://www.genome.jp/tools/motif/).PRED-COUPLE2 (http://athina.biol.uoa.gr/bioinformatics/PRED-COUPLE2/) (Sgourakis et al., 2005) was used to predict the coupling specificity of GPCRs to the four families of G-proteins, with a stringent cutoff of 0.3 to discriminate between positive and negative predictions.Therefore, results below this limit were not considered as positive predictions and were discarded.

Identification of putative GPCRs
Out of the total 21,699 predicted proteins of H. bacteriophora, 21 sequences were identified as GPCRs using the pipeline presented in Fig. 2. At the initial stage of identification, 7,613 sequences with more than 200 amino acid residues were short-listed.Four different transmembrane predictors identified varying numbers of sequences containing transmembrane domains.TMHMM2, a widely used transmembrane predictor, identified 97 sequences containing 7-TM helices out of 7,613 sequences, of which 14 7-TM sequences were uniquely identified by TMHMM2 (Fig. 3).Another tool, Phobius, detected 139 7-TM sequences, of which 42 were unique (Fig. 3).In addition,      The common 69 heptathetical transmembrane sequences were passed through four GPCR prediction tools to identify the GPCRs among these sequences.GPCRHMM, GPCRPipe, GPCR-Pen, and GPCRPred identified 22, 21, 15, and 62 sequences as GPCRs, respectively (Fig. 3B).Interestingly, GPCRHMM, GPCRPipe, and GPCR-Pen did not identify any unique   GPCR sequence, but GPCRPred identified 36 unique GPCRs not predicted by any other tool.In summary, 21 sequences were considered as putative GPCRs as they were identified by at least three of the four tools used for GPCR prediction, and further confirmed by detailed analysis of the sequences (Fig. 3B, Tables 1-3,  supplementary Tables 2 -4).

JOURNAL OF NEMATOLOGY
Using the same pipeline, 27 GPCR sequences were identified from the proteomic dataset of H. bacteriophora, published by McLean and coworkers in 2018.Out of these sequences, 13 were found to be similar, with the GPCRs fetched from the earlier version of annotation, published in 2013 (Table 4).
A total 1,252 GPCRs (supplementary Table 5) were identified out of the 28,447 predicted proteins of C. elegans using the pipeline proposed in this study (Fig. 2).Initially 21,372 sequences with >200 amino acid residues were screened for 7-TM domains.Out of four different transmembrane predictors, Phobius detected the highest number (1782) of 7-TM sequences, with 116 unique ones.TMHMM2 identified 1,566 sequences containing 7-TM helices, of which 80 sequences were unique.In addition, 1,710 sequences were predicted by HMMTOP2, with the highest number (178) of unique sequences.TOPCONS was able to identify only 791 protein coding genes with 7-TM helices, but <1% of those were unique.A total of 1,404 sequences were predicted to have 7-TM helices by at least three out of four tools (Fig. 3C).The 1,404 resultant sequences were screened using four GPCR predictors.GPCRHMM, GPCRPipe, GPCR-Pen, and GPCRPred identified 1,255, 1,393, 258 and 1,388 sequences as GPCRs, respectively (Fig. 3D).Interestingly, GPCRHMM and GPCR-Pen could not identify any unique GPCR sequence.Finally, 1,252 sequences were considered putative GPCRs as they were identified by at least three of the four GPCR detectors (Fig. 3D, supplementary Table 5).

Further characterization of predicted GPCRs
All the sequences identified as GPCRs above were validated by the GPCR-CA tool, as it uses Cellular Automaton (CA) images to reveal the features hidden in complex protein sequences.It designated all these GPCRs as Class A Rhodopsin-like GPCR, except Hba_17528, which was identified as Class D Fungal pheromone GPCR (Table 1).Further, the predicted GPCRs were classified based on function and protein family, and four chemosensory GPCRs (Hba_07805, Hba_18427, Hba_18743 and Hba_17528), ten 7-transmembrane receptors under the rhodopsin family, one rhodopsin-like GPCR with transmembrane domain, one frizzled type, and one secretin type GPCR were identified (Table 1, supplementary Table 1).In addition, the reciprocal BLAST was used to find the orthologous sequences of the predicted GPCRs from the proteomic dataset of closely related model organisms.This approach also validated the pipeline and confirmed that all the identified protein sequences were GPCRs.Sequence similarities found a frizzled type (Hba_19080), a secretin type (Hba_20566) and the other 19 as rhodopsin types of GPCR (Table 1).The GPCRPred tool classified all the 21 shortlisted sequences as Class A rhodopsinlike GPCR and further classified them into three different subfamilies: peptide (15 sequences), biogenic amine (3 sequences), and lysospingolipid (1 sequence) (Table 1).A search against InterPro database confirmed the families of all the 21 proteins as G protein-coupled receptors except Hba_17528.All the GPCRs were suggested to be involved in G protein-coupled receptor signalling pathway except Hba_19080, which was annotated to have a role in cell surface receptor signalling.Additionally, Hba_18878 was suggested to be involved in the neuropeptide signalling pathway, and Hba_18906 in pheromone responsiveness.Most of them are rhodopsin-like GPCR, except Hba_19080 (frizzled/secreted, frizzledrelated protein) and Hba_20566 (secretin-like).All these GPCRs were found to be integral components of the cell membrane by gene ontology (GO) analysis, except Hba_18906 (Table 2).
Additionally, NCBI conserved domain analysis revealed that four identified proteins were members of the class A seven-transmembrane GPCRs and belonged to FMRFamide (Phe-Met-Arg-Phe)-like receptors and related proteins.Eight of the proteins were rhodopsin receptor-like class A family of the seven-transmembrane GPCR superfamily, which constitutes about 90% of all GPCRs.They include lightsensitive rhodopsin and receptors for biogenic amines, lipids, nucleotides, odorants, peptide hormones, and various other ligands.Six of the proteins were broadly classified under the 7tm_GPCRs superfamily.Among these, Hba_18743 was a serpentine type 7TM GPCR chemoreceptor under the Srsx family, the only family among the various superfamilies of chemoreceptors.Another serpentine type of chemoreceptor GPCR (Hba_17528) was found under the srx family, which is a part of the Srg superfamily of chemoreceptors.Interestingly, Hba_20566 was a pigment-dispersing factor receptor (PDFR), a member of the B1 subfamily of class B seven-transmembrane GPCRs, also referred to as the secretin-like receptor family.Hba_18203 was found to be an FMRFamide receptor and a member of the class A family of seven-transmembrane G protein-G-protein-coupled receptors (GPCRs) in Heterorhabditis bacteriophora: Kundu et al. coupled receptors.Hba_09978 was a cholecystokinin receptor and came under the class A family of seventransmembrane GPCRs.This group represents four GPCRs that are members of the RFamide receptor family, including cholecystokinin receptors (CCK-AR and CCK-BR), orexin receptors (OXR), neuropeptide FF receptors (NPFFR), and pyroglutamylated RFamide peptide receptors (QRFPR).Hba_08130 was an amine receptor of the class A family of GPCRs, which include adrenoceptors, 5-HT (serotonin) receptors, muscarinic cholinergic receptors, dopamine receptors, histamine receptors, and trace amine receptors (supplementary Table 2).
PRED-COUPLE2 analysis showed possible interaction of 21 GPCRs to the different families (Gi/o-, Gq/11-, Gs-and G 12/13 ) of G-proteins (Table 1).Among these, Gi/o is the most abundant kind of G-protein, which can bind to all the fetched GPCRs.There are few GPCRs which have coupling specificity only to Gi/o.Additionally, there are promiscuous GPCRs, except Hba_14891, Hba_19161, Hba_20566, and Hba_17528, which can couple to members of   1).

Discussion
The entomopathogenic nematodes such as Heterorhabditis and Steinernema depend on chemoreception to locate and infect their hosts in soil, and are used as biocontrol agents for insect pests of crops (O'Halloran and Burnell, 2003;Hallem et al., 2007;Hallem et al., 2011).In order to perform molecular investigations on chemosensory GPCRs in Heterorhabditis nematodes, identification of the GPCRs is the first and most crucial step.We developed a highly stringent bioinformatic pipeline to identify 21 GPCRs in H. bacteriophora.These 21 sequences are less than 82 GPCRs identified earlier in the genome of H. bacteriophora (Bai et al., 2013), which may be attributed to differences in algorithms and analysis tools used.Bai et al., 2013, searched for presence of GPCR domains in protein sequences based on the HMMER algorithm using Pfam with an e-value cut-off of 1e-4.However, the primary prerequisite for a protein to be a GPCR is the presence of seven stretches of 25-35 consecutive amino acid residues (Fredriksson et al., 2003;Kroeze et al., 2003;Schiöth and Fredriksson, 2005;Rosenbaum et al., 2009;Trzaskowski et al., 2012).To become thermodynamically stable, a transmembrane protein should be folded and packed in such a way that the average length of surface loops and helices of proteins remains 19 and 26 (Meruelo et al., 2012).Therefore, the minimum length of a 7-transmembrane domain containing a protein-coding DNA sequence should be 296 bp.Accordingly, the sequence length cutoff of 200 bp was used in this study.Langeland et al., 2021 created an integrative database of nematode chemoreceptors called NemChR-DB to facilitate the analysis of NemChRs (http://ohalloranlab.net/nemchr-db).In this database, 53 proteins from H. bacteriophora were annotated as NemChRs compared to four NemChRs (Hba_07805, Hba_18427, Hba_18743, and Hba_17528) identified in our study.These four proteins are also listed in NemChR-DB; a total of only seven GPCRs are identified by both our study and Langeland et al., 2021.The reason for the mismatch in numbers of GPCRs identified may be because the present study used highly stringent parameters for identifying a protein as GPCR.Although Langeland et al., 2021 state that 7-TM helices are considered as ideal characteristics for any GPCR (http://ohalloranlab. G-protein-coupled receptors (GPCRs) in Heterorhabditis bacteriophora: Kundu et al.
net/nemchr-db/methods.html), they have also included proteins with 5, 6, 8, and 9 TM domains as GPCR in NemChRs.Their approach is inclusive, whereas we followed highly reductionist approach.In our study, in addition to the 7-TM sequences, four additional, and different, tools were used to search for GPCRs, resulting in a smaller set of sequences being identified as GPCRs as compared to the set identified by Langeland et al. Further, to predict NemChRs, Langeland et al., 2021 andWheeler et al., 2020 used "hmmsearch" to compare proteins against a database of GPCR Pfam hidden Markov models (HMMs).To revalidate our findings, we also screened the 21 GPCRs through Pfam search, and all were confirmed to contain GPCR domains.Bernot et al., 2020 investigated the expression of GPCRs of Ancylostoma ceylanicum across various life stages, and GPCRs were identified based on annotations of the sequences resulting from the transcriptomes.It may be suggested that the GPCRs identified using our methodology are accurate and precise due to higher stringency in the filtration criteria as compared to these studies.Our pipeline identified 1,252 C. elegans proteins as GPCRs, a number very close to 1,100 and 1,341 GPCRs reported previously (Bargmann, 1998;Thomas and Robertson, 2008), thus validating the correctness of our method.
Further, these filtered sequences were passed through different transmembrane detectors; α-helical transmembrane proteins are the most important class of membrane proteins and constitute almost 20-30% of all the proteins encoded in a genome (Wallin and von Heijne, 1998;Krogh et al., 2001).From the retained sequences, TMHMM2 detected 97 sequences containing seven transmembrane helices while Phobius, HMMTOP2, and TOPCONS identified 139, 153, and 123 sequences with the same property, respectively.There are more than ten insilico programs available at public interface to identify transmembrane domains.A comparative study proved that TMHMM2 and HMMTOP2 were more accurate than other programs and Phobius was reported to perform comparably (Cuthbertson et al., 2005).TMHMM2 is highly specific and sensitive with >99% accuracy in differentiating membrane and soluble proteins.Additionally, it is highly accurate (97.5%) in predicting helical regions of a transmembrane protein (Krogh et al., 2001).Along with TMHMM2, all the three transmembrane detectors are highly precise in predicting helical regions.If we consider only membrane protein detection with no signal peptide, performance of TOPCONS is around 80%.Though it follows the Hidden Markov Model for transmembrane detection and structure prediction, it relies on a Viterbi-like algorithm to score the final topology model (Tsirigos et al., 2015).Phobius, which is based on HMM, models the different sequence regions of a signal peptide and the different regions of a transmembrane protein in a series of interconnected states with an accuracy of 63.6%, while predicting only TM (Käll et al., 2004).Errors may occur in ORF identification or transmembrane prediction by a single bioinformatic tool.Though all the tools use similar kinds of HMM architecture, the methods and pathways involved in the prediction process are quite different.Their accuracies are also different, due to variation in cross-validation methods and datasets used in the background.So the use of four different transmembrane detectors provided accurate prediction and reduced chances of generating any false positives or negatives, thereby strengthening the GPCR detection pipeline.Among the queries, 69 proteins were considered as seven transmembrane sequences, as those were confirmed by at least three out of four transmembrane detectors.
Among these 7-TM sequences, 22 sequences were identified as GPCRs by GPCRHMM, a Hidden Markov Model-based GPCR recognition software which identifies TM topology-related features.It captures the variation in amino acid composition and topological segment lengths between GPCR families.It has a bare minimum error rate of identification of GPCR, in comparison to other HMM-based GPCR predictors, including Pfam.It has shown a higher percentage of selectivity and sensitivity over profile HMMs and generic transmembrane detectors on sets of known GPCRs and non-GPCRs (Wistrand et al., 2006).As GPCRHMM and GPCRPipe use a similar type of algorithm to detect GPCRs, they identified nearly the same numbers of GPCRs.The "AND" method of GPCRPipe has an accuracy of 97% and sensitivity and specificity of around 91% and 100%, respectively.These values are higher than any other GPCR detectors (Theodoropoulou et al., 2013).On the other hand, GPCRPen comprises sequence similarities (BLAST), common sequence motif profiles (Pfam), transmembrane structure (GPCRTm), and dipeptide composition (GPCRPred) (Begum et al., 2020).But it has only predicted 15 GPCR sequences, as we have restricted our search to the first two algorithms.When a completely different GPCR prediction server (GPCRPred) was employed for the same purpose, it recognized a much higher number of sequences (62) as GPCRs, compared to the other three tools.It is a support vector machine based on dipeptide composition (Bhasin and Raghava, 2004).The completely different search algorithm and trained database resulted in a higher number of sequences, which is almost 3-4 times the number of sequences JOURNAL OF NEMATOLOGY fetched by the remaining three programs.GPCR recognition accuracy of GPCRPred is up to 99.5% using 5-fold cross-validation.All the resultant sequences were reconfirmed by screening through GPCR-CA, which proves that the pipeline is highly stringent, as it depends on a completely different algorithm to predict and classify GPCRs, which was not used at any earlier stage of the pipeline.It utilizes CA images to reveal the features hidden in a bunch of long and complex protein sequences.The gray-level co-occurrence matrix factors extracted from these images are used to represent the samples of proteins through their pseudo amino acid composition.It designated all these GPCRs as Class A rhodopsin-like GPCRs.Likewise, GPCRPred has also classified all the fetched GPCRs as Class A rhodopsin types.GPCRPred can classify GPCRs into five major classes or families with an overall Matthew's correlation coefficient (MCC) and accuracy of 0.81 and 97.5%, respectively (Bhasin and Raghava, 2004).It has been suggested that despite having low sequence similarity and diversified signal molecules, GPCRs involved in chemoreception might have originated from the rhodopsin family of GPCRs (Nordström et al., 2011).This rhodopsin family is the most abundant and diverse among all the GPCR families.They also have a unique signal transduction mechanism (Rosenbaum et al., 2009).Presence of an extracellular N-terminus domain, an intracellular C-terminus domain, and seven serial transmembrane hydrophobic helices joined by intracellular and extracellular loops are typical properties of GPCRs (Brody and Cravchik, 2000;Kroeze et al., 2003;Rosenbaum et al., 2009;Hanlon and Andrew, 2015).Except Hba_18906, all the retrieved sequences exhibited these properties, which strengthens our pipeline and the selection.
The same methodology of mining GPCRs was applied to the C. elegans proteomic dataset.Earlier, Troemel and coworkers in 1995 found a large set of GPCRs (~1341) from the C. elegans genome (Troemel et al., 1995).Bargmann (1998) reported around 1100 GPCRs in C. elegans.Based on manual curation and sequence comparisons, Thomas and Robertson in 2008 identified nearly 1,300 genes encoding GPCRs in this nematode.Most of these are NemChRs, which are believed to be involved in sensing external environment in absence of visual and auditory systems in C. elegans.The number of GPCRs predicted by all these studies is similar to 1,252 GPCRs identified in our study in the genome of C. elegans, thus confirming the accuracy and preciseness of the pipeline used in the present study.
Involvement of GPCRs in a wide array of physiological and pathological processes (Dryer and Berghard, 1999;Mombaerts, 1999;Schiöth and Fredriksson, 2005;Nei et al., 2008), and the presence of their ligand binding sites on cell surfaces, have made them the most suitable and accessible drug targets, for, for example, angiotensin receptor blockers (ARBs) for hypertension (Ghosh et al., 2015;Odoemelam et al., 2020;Alhosaini et al., 2021).There are several GPCRs with unknown ligand binding properties, known as orphan GPCRs.GPCR-ligand interaction and its downstream effect is dependent on the interaction of the GPCR under study with members of a specific G-protein subfamily.Therefore, predicting coupling specificity of orphan GPCRs to G-protein subfamilies is essential to find potential drug targets through heterologous expression studies (Wess, 1998).However, GPCRs with low sequence similarity may couple to members of the same subfamily of G-proteins, while members of the same GPCR subfamilies often couple to members of distinct G-protein subfamilies (Wong, 2003).As promiscuous GPCRs are found to be coupled with more than one G-protein subfamily, it is evident that coupling is a multidimensional function rather than one-by-one function (Hermans, 2003;Sgourakis et al., 2005).
The NemChRs identified in this study must be functionally validated for their roles in chemoreception in EPN H. bacteriophora.NemChRs are generally expressed in amphid sensory neurons (Vidal et al., 2018) and are expected to facilitate host recognition, seeking, and detection (Bernot et al., 2020;Wheeler et al., 2020).GPCRs, like NemChRs, may be responsible for attraction and movement towards a host, and may be manipulated to modulate the nematode behavior in soil, as shown in the case of root-knot nematode M. incognita (Bresso et al., 2019).Like NemChRs, these GPCRs may also provide ideal drug targets for nematicides and anthelminthics (Robertson, 2006;Krishnan et al., 2014).Lastly, the bioinformatic pipeline developed in this study may be used to identify GPCRs from the genome of any organism, including nematodes.
153 sequences were predicted by HMMTOP2, with 59 unique sequences.TOPCONS could identify only 123 proteins containing 7-TM helices, but only 25% of those were uniquely identified.A total of 69 sequences were predicted to have 7-TM helices by at least three out of four transmembrane predictors used for this study (Fig.3A, supplementary

Figure 4 :
Figure 4: The conserved motif distribution across the 21 GPCRs.Each categorized motif logo generated by MEME is displayed in differentially colored boxes.Legend (in the right-hand side) depicts the protein sequence of corresponding motifs.Motifs were serially numbered according to their frequency of occurrence in MEME bioinformatics tool.

Table 1 .
The characterization and classification of GPCR sequences identified through a bioinformatics pipeline using various annotation tools and methods, and their probable coupling specificity with different G-proteins

Table 1 .
ContinuedTable2.The Gene Ontology annotations of identified GPCR sequences of H. bacteriophora.

Table 2 .
ContinuedTable3.Similarity of motifs identified by MEME analysis in GPCRs with the known protein domains as analyzed by HHPred G-protein-coupled receptors (GPCRs) in Heterorhabditis bacteriophora: Kundu et al.

Table 3 .
Continued JOURNAL OF NEMATOLOGY

Table 4 :
Similarity of GPCRs identified from two different versions of annotations of H. bacteriophora

Table 2 :
GPCR sequences (21) fetched from three out of four GPCR predictors used in this study